Estimating Rarity and Similarity over Data

نویسندگان

  • Mayur Datar
  • S Muthukrishnan
چکیده

In the windowed data stream model, we observe items coming in over time. At any time t, we consider the window of the last N observations at (N 1); at (N 2); : : : ; at, each ai 2 f1; : : : ; ug; we are allowed to ask queries about the data in the window, say, we wish to compute the minimum or the median of the items in the window. A crucial restriction is that we are only allowed o(N) (often polylogarithmic in N) storage space, that is, space smaller than the window size, so the items within the window can not be archived. Window data stream model arose out of the need to formally reason about the underlying data analyses problems in applications like inter-networking and transactions processing. In this paper, we study two basic problems in the windowed data stream model. The rst is the estimation of the rarity of items in the window. While previous work has studied simple distributional parameters such as the number of distinct items in the window, no prior work has addressed the general problem of estimating the rarity. Our second problem is one of estimating similarity between two data stream windows using the Jacard's coe cient. Prior work has focused on Lp norms and set similarity measures such as the Jacard's coe cient have been studied before in the windowed data stream model. The problems of estimating rarity and similarity have many applications in mining massive data. We present novel, simple algorithms for estimating rarity and similarity on windowed data streams, accurate up to factor 1 using space only logarithmic in the window size. In both cases, our solutions are based on modi cations of the powerful min-wise hashing technique. We expect our solutions to nd applications in practice.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A novel method for detecting structural damage based on data-driven and similarity-based techniques under environmental and operational changes

The applications of time series modeling and statistical similarity methods to structural health monitoring (SHM) provide promising and capable approaches to structural damage detection. The main aim of this article is to propose an efficient univariate similarity method named as Kullback similarity (KS) for identifying the location of damage and estimating the level of damage severity. An impr...

متن کامل

A New Similarity Measure Based on Item Proximity and Closeness for Collaborative Filtering Recommendation

Recommender systems utilize information retrieval and machine learning techniques for filtering information and can predict whether a user would like an unseen item. User similarity measurement plays an important role in collaborative filtering based recommender systems. In order to improve accuracy of traditional user based collaborative filtering techniques under new user cold-start problem a...

متن کامل

Estimating the Parameters for Linking Unstandardized References with the Matrix Comparator

This paper discusses recent research on methods for estimating configuration parameters for the Matrix Comparator used for linking unstandardized or heterogeneously standardized references. The matrix comparator computes the aggregate similarity between the tokens (words) in a pair of references. The two most critical parameters for the matrix comparator for obtaining the best linking results a...

متن کامل

Ecological specialization and rarity indices estimated for a large number of plant species in France

The biological diversity of the Earth is being rapidly depleted due to the direct and indirect consequences of human activities. Specialist or rare species are generally thought to be more extinction prone than generalist or common species. Testing this assumption however requires that the rarity and ecological specialization of the species are quantified. Many indices have been developed to cl...

متن کامل

Improving Imbalanced data classification accuracy by using Fuzzy Similarity Measure and subtractive clustering

 Classification is an one of the important parts of data mining and knowledge discovery. In most cases, the data that is utilized to used to training the clusters is not well distributed. This inappropriate distribution occurs when one class has a large number of samples but while the number of other class samples is naturally inherently low. In general, the methods of solving this kind of prob...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002